Structural alignment of plain text books

نویسندگان

  • André Santos
  • José João Almeida
  • Nuno Ramos Carvalho
چکیده

Text alignment is one of the main processes for obtaining parallel corpora. When aligning two versions of a book, results are often affected by unpaired sections – sections which only exist in one of the versions of the book. We developed Text::Perfide::BookSync, a Perl module which performs books synchronization (structural alignment based on section delimitation), provided they have been previously annotated by Text::Perfide::BookCleaner. We discuss the need for such a tool and several implementation decisions. The main functions are described, and examples of input and output are presented. Text::Perfide::PartialAlign is an extension of the partialAlign.py tool bundled with hunalign which proposes an alternative methods for splitting bitexts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Digital Talking Books in Multiple Languages and Varieties

This paper describes our work in digital talking book alignment, starting by our earlier efforts for the alignment of books in European Portuguese, and ending with the two challenges we are currently facing of aligning books in different varieties of Portuguese and aligning parallel books in different languages. Our alignment module proved robust enough for porting to other varieties of Portugu...

متن کامل

Structural dynamics in northern Atlas of Tunisian, Jendouba area: insights from geology and gravity data

This paper presents a new interpretation of the geometry of Triassic alignment of J. Sidi Mahdi –J. Zitoun in Medjerda Valley Plain (Northern Tunisia) based on detailed analysis of gravity and seismic reflection data. The main results of gravity analysis do not show a distinguish gravity anomaly over Triassic evaporites bodies. The positive gravity anomaly seems to be related to the entire stru...

متن کامل

Towards a repository of digital talking books

Considerable effort has been devoted at to increase and broaden our speech and text data resources. Digital Talking Books (DTB), comprising both speech and text data are, as such, an invaluable asset as multimedia resources. Furthermore, those DTB have been under a speech-to-text alignment procedure, either word or phone-based, to increase their potential in research activities. This paper thus...

متن کامل

Technique for automatic sentence level alignment of long speech and transcripts

A frugal approach to construct speech corpora, specially for resource deficient languages, is to exploit collections of speech and corresponding text data available in audio books, news, lectures. However, using these resources for building speech corpora require an alignment of the long duration speech data with the accompanying text data. Existing techniques for automatic speech-text alignmen...

متن کامل

The Speect text - to - speech system entry for the Blizzard Challenge 2013

This paper describes the Speect text-to-speech system entry for the Blizzard Challenge 2013. The techniques applied for the tasks of the challenge are described as well as the implementation details for the alignment of the audio books and the text-to-speech system modules. The results of the evaluations are given and discussed.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012